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Abstract 

Despite the prevalence of intron losses during eukaryotic evolution, the selective forces acting on them have not been extensively 
explored. /4rab/c/ops/s thaliana lost half of its genome and experienced an elevated rate of intron loss after diverging from A. lyrata. The 
selective force for genome reduction was suggested to have driven the intron loss. However, the evolutionary mechanism of genome 
reduction is still a matter of debate. In this study, we found that intron-lost genes have high synonymous substitution rates. Assuming 
that differences in mutability among different introns are conserved among closely related species, we used the nucleotide substi- 
tution rate between orthologous introns in other species as the proxy of the mutation rate oi Arabidopsis introns, either lost or extant. 
The lost introns were found to have higher mutation rates than extant introns. At the genome-wide level, A. thaliana has a higher 
mutation rate than/A. lyrata, which correlates with the higher rate of intron loss and rapid genome reduction of A. thaliana. Our results 
indicate that selection to minimize mutational hazards might be the selective force for intron loss, and possibly also for genome 
reduction, in the evolution oi A. thaliana. Small genome size and lower genome-wide intron density were widely reported to be 
correlated with phenotypic features, such as high metabolic rates and rapid growth. We argue that the mutational-hazard hypothesis 
is compatible with these correlations, by suggesting that selection for rapid growth might indirectly increase mutational hazards. 
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Introduction 

Spliceosomal introns are unevenly distributed among different 
organisms and among different genes of the same organism 
(Jeffares et al. 2006; Roy and Gilbert 2006; Rogozin et al. 
2012). Vertebrates and plants have hundreds of thousands 
of introns in their genomes, whereas the tiny genome of 
the nucleomorph Hemiselmis andersenii has no introns at all 
(Lane et al. 2007). Among the genes of intron-rich organisms 
(such as humans), some genes are intronless, whereas some 
others each have more than 100 introns. Accumulating evi- 
dence indicates that the early eukaryotes were intron rich (Roy 
and Gilbert 2005a; Csuros et al. 2011). The differences in 
intron density can be explained mainly by the different rates 
of intron loss and partially by the different rates of intron gain 
(Roy 2006; Rogozin et al. 2012). In principle, the different 
rates of intron loss and gain might be caused either by muta- 
tional differences in removing old introns or generating new 
introns, or by differences in selective pressures for or against 
intron accumulation. However, only a few studies have 



attempted to explore the selective forces acting on intron 
gain and loss (Llopart et al. 2002; Lynch 2002; Lynch and 
Conery 2003; Lane et al. 2007; Stajich et al. 2007; Niu 2008). 

Having introns does confer some benefits. Some introns 
can expand protein diversity through alternative splicing 
(Kalsotra and Cooper 201 1). Some introns or elements con- 
tained in introns regulate gene expression (Le Hir et al. 2003; 
Wang et al. 2007; Rose et al. 2008; Parenteau et al. 201 1; 
Rearick et al. 201 1). The selective pressure on the loss or gain 
of these introns is obvious. However, there is no direct evi- 
dence of the proportions of introns that are actually involved 
in these functions. Surveys of nucleotides subject to purifying 
selection indicate that functional sequences do not exceed 
12% of the human genome (Lindblad-Toh et al. 2011; 
Ponting and Hardison 2011). Recently, the majority of the 
human genome was found to have biochemical activity, 
which was considered as evidence against the existence of 
junk DNA in the large genome (Ecker et al. 2012; Pennisi 
201 2; ENCODE Project Consortium 201 2). However, this con- 
clusion has been criticized to be rather farfetched (Eddy 201 2; 
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Niu and Jiang 201 3). Other possible benefits of having introns 
that do not depend on specific sequences have been pro- 
posed (Forsdyl<e 1981; Fedorova and Fedorov 2005; Niu 
2007; Niu and Yang 201 1). They could, in principle, be applied 
to all spliceosomal introns. However, no convincing evidence 
implicates the sequence-independent benefits as active selec- 
tive forces in intron evolution. 

There is also a nonadaptive view of the evolution of 
genome complexity, including the accumulation of introns 
(Lynch 2002, 2007a, 2007b; Gray et al. 2010). Most introns 
are considered as nearly neutral but slightly deleterious. The 
absence of introns in prokaryotes and the scarcity of introns in 
some eukaryotes are attributed to efficient removal of introns 
by purifying selection. In vertebrates, smaller population sizes 
are thought to have relaxed the purifying selection against 
introns. An advanced version of the nonadaptive view of non- 
coding sequence evolution is the mutational-hazard hypothe- 
sis (Lynch 2006, 2007b; Lynch et al. 2006). More noncoding 
DNA is seen as more likely to accumulate deleterious muta- 
tions. The length of the DNA and the mutation rate determine 
the selective burden of carrying the surplus DNA. The evolu- 
tionary fate of the surplus DNA is determined by the efficiency 
of selection, which is mainly determined by effective popula- 
tion size. This hypothesis is more accessible and has attracted a 
great deal of attention; however, it is still a matter of debate 
(Lynch and Conery 2003; Whitney and Garland 2010; 
Whitney et al. 2010, 2011; Boussau et al. 2011; Lynch 
201 1; Kelkar and Ochman 2012). 

In recent years, intron loss has been associated with genome 
reduction. The highly compacted genome of the nucleomorph 
H. andersenii has lost all its introns (Lane et al. 2007). The plant 
pathogen Ustilago maydis has a rather small genome com- 
pared with related, sequenced fungi. Comparative analysis re- 
vealed that massive intron loss had contributed to the genome 
reduction (Kamper et al. 2006). Within 10 Myr, Arabidopsis 
thaliana lost almost half of its genome (Hu et al. 201 1; Proost 
etal. 201 1). Consistent with this rapid genome reduction, a six- 
time higher rate of intron loss in A. thaliana compared with its 
relative/^, lyrata was reported. Introns make a huge contribu- 
tion to the genome size; therefore, it is quite likely that the 
selective force for genome reduction acts as a selective force 
against intron accumulation (Fawcett et al. 2012). 

In both plants and animals, small genomes are associated 
with many phenotypes, including small nuclei, small cells, 
short cell cycles, high metabolic or photosynthetic rates, 
rapid growth, and short generation time (Gregory 2002, 
2005; Cavalier-Smith 2005; Knight et al. 2005; Dufresne 
and Jeffery 201 1). However, the molecular mechanism under- 
lying the selective force for genome reduction remains unclear 
(Knight et al. 2005; Dufresne and Jeffery 2011; Lynch et al. 
2011). Both selection for metabolic, temporal, and spatial 
economy and selection to minimize mutational hazards 
might have constrained genome sizes. In this study, we 
found that intron loss and genome reduction of A. thaliana 



are significantly associated with high mutation rates, which is 
consistent with the mutational-hazard hypothesis. Further- 
more, we suggest that the mutational-hazard hypothesis 
might underlie the reported correlations between genome 
size and phenotypes. 

Materials and Methods 

Genomes and Transcriptomes 

We downloaded the genome sequences and annotation files 
of A. thaliana from the Arabidopsis Information Resource 
(TAIR10 release, http://vvww.arabidopsis.org/, last accessed 
May 9, 201 1), Thellungiella parvula (synonym: Eutrema parvu- 
lum) from Thellungiella.org (version 2.0, http://Thellungiella 
.org/, last accessed July 27, 2012), and Brassica rapa from 
the Brassica Database (version 1.2, http://brassicadb.org/ 
brad/, last accessed July 27, 2012). The genomes and anno- 
tations for the following organisms were obtained from 
Phytozome (version 7.0, http:/AAww.phytozome.net/, last 
accessed May 12, 2011): A. lyrata (JGI release vl.O), Carica 
papaya (ASGPB release of 2007), Glycine max (JGI Glymal .0 
annotation of the chromosome-based Glymal assembly), 
Medicago truncatula (Release Mt3.0 from the Medicago 
Genome Sequence Consortium), Populus trichocarpa (JGI as- 
sembly release v2.0, annotation v2.2), and Vitis vinifera 
(March 201 0 1 2X assembly and annotation from Genoscope). 

Following the recommendations of a previous study 
(Jeffares et al. 2008), we downloaded the "Stress Series" 
data related to the AtGenExpress Project from the microarray 
database of the Nottingham Arabidopsis Stock Center (http:// 
arabidopsis.info/, last accessed October 9, 201 1). The data set 
includes genome-wide expression data of A. thaliana under 
cold, osmotic, salt, drought, genotoxic, oxidative, UV-B, 
wounding, and heat stress conditions, as well as control con- 
ditions. The median value of all control plant data points of a 
gene was used to represent the gene expression level. Genes 
that could quickly change their rate of transcription were as- 
sumed to have a low time cost. The speed of gene expression 
regulation was defined as the maximum rate of transchptional 
change per unit of time (Jeffares et al. 2008). 

Identification of Orthologous Genes 

Initially, genes were filtrated out for obvious annotation errors, 
such as coding sequences that were not a multiple of three 
nucleotides. For the genes with alternatively spliced isoforms, 
the longest mRNA was used for analysis. Using the best recip- 
rocal Basic Local Alignment Search Tool (BLAST) hits, the 
homologous proteins between A. thaliana and A. lyrata 
were detected with thresholds of E values < 10"^°. A total 
of 21,158 groups of homologous proteins were obtained. 
Among these, 2,187 homologous groups consisting only of 
intronless genes were excluded, leaving 18,971 intron- 
containing homologous groups. 
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Recent genome analysis revealed several rounds of genome 
duplication and subsequent massive loss of genes in the evo- 
lution of Arabidopsis (Proost et al. 201 1 ). In cases where asym- 
metric losses of paralogs have occurred between A. thaliana 
and A. lyrata, the best hits of reciprocal best BLAST represent 
homoeologs, rather than true orthologs. For this reason, the 
above intron-containing homologous groups were filtered by 
SynMap (Lyons et al. 2008). We have run our own analysis in 
SynMap with its recommended or default settings. For two 
settings that lack recommendations, we used BLASTN as the 
BLAST algorithm and Quota align algorithm to enforce a 1:1 
syntenic depth between genomes. Ultimately, 1 6,266 homol- 
ogous groups were found in conserved synteny blocks and 
regarded as orthologs. 

Identification of Intron Loss and Gain 

First, the orthologous proteins were aligned using ClustalW 
(version 2.1) (Larkin et al. 2007). With the aligned protein 
segments as markers, the full-length DNA sequences of the 
orthologous genes were aligned using ClustalW (version 2.1). 
These two steps were repeated using MUSCLE (version 
3.8.31) (Edgar 2004). All conflicting results were checked 
manually. Before determining intron loss or gain, some align- 
ments were manually improved. 

Referring to previous studies (Roy and Penny 2006a, 2007; 
Zhang et al. 2010), the alignments were filtered. An intron 
position was discarded if 1) it is too close to the gene ends, 
with less than 45 nucleotides flanking either side, or 2) its 
flanking exon sequence has an identity lower than 0.68. 
The identity was calculated by counting 45 bp either side of 
an intron position. The value 0.68 represents the first quintile 
of the identities of all the aligned A. thaliana and A. lyrata 
orthologous mRNAs. In total, 2,534 positions that differ in 
the presence/absence of introns were observed between 
A. thaliana and A. lyrata. Seven species (B. rapa, C. papaya, 
G. max, P. trichocarpa, T. parvula, M. truncatula, and V. vinif- 
era) were then used as outgroups to distinguish intron losses 
from intron gains (fig. 1). In Arabidopsis, two previous studies 
gave conflicting results on the relative frequency of intron loss 
and gain (Knowles and McLysaght 2006; Fawcettetal. 2012). 
However, in other eukaryotic lineages, most studies consis- 
tently indicated that intron loss greatly outnumbers intron 
gain (Roy and Gilbert 2005b, 2006b, 2007; Coulombe- 
Huntington and Majewski 2007a, 2007b; Stajich et al. 
2007; Csuros et al. 201 1). In this study, a more conservative 
criterion was used to define intron gain than to define intron 
loss. Dollo parsimony was used to define intron gains, 
whereas standard parsimony was used to define intron 
losses. The possible underestimation of the intron gain rate 
would affect both A. thaliana and A. lyrata simultaneously; 
therefore, it would not affect comparisons between A. thali- 
ana and A. lyrata. In this way, 132 putative intron losses and 



r- Arabidopsis thaliana 
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Fig. 1. — Phylogenetic tree used to distinguish intron loss and gain in 
Arabidopsis thaliana and A. lyrata. The tree was constructed using the 
phylogenetic tree from Phytozome (v8.0, http:/AAA/vw.phytozome.nef, 
last accessed January 20, 2012) and is not scaled according to phyloge- 
netic distances. Dollo parsimony was used to define intron gains. That is, 
an intron gain in Arabidopsis was categorized when there were no introns 
in the position of the orthologous genes of any outgroup species. 
Meanwhile, there should be at least two outgroup branches that definitely 
showed absence of the intron. Standard parsimony was used to define 
intron losses. That is, an intron should be present more often than it is 
absent in the outgroup branches to define an intron loss. In cases where 
two or three species of the same outgroup branch differ in the absence/ 
presence of an intron, the branch was not referred to as defining intron 
loss. Full lists of the presence, absence, and uncertainty of introns in the 
orthologous genes of the nine species are available in supplementary 
tables S5 and S6, Supplementary Material online. 

no intron gains were detected in A. thaliana, and 35 putative 
intron losses and 55 putative intron gains in A. lyrata. 

In a recent study, Fawcett et al. (2012) found 90 intron 
losses and two intron gains in A. thaliana and 15 intron 
losses and nine intron gains in A. lyrata. The nonoverlapping 
results were manually checked and filtered by SynMap (Lyons 
et al. 2008). Two intron losses for^. thaliana and one intron 
loss and one intron gain for A. lyrata were integrated from 
their results. We also examined the intron losses and gains 
observed by Knowles and McLysaght (2006). No convincing 
nonoverlapping results were observed. Therefore, 134 puta- 
tive intron losses in A. thaliana and 36 putative intron losses 
and 56 putative intron gains in A. lyrata remained. 

A simple insertion in exonic sequence might be misanno- 
tated as an intron and consequently misidentified as an intron 
gain. Similarly, if an exonic segment was misannotated as an 
intron in orthologous genes, a simple deletion of the segment 
would result in misidentification of an intron loss. Therefore, 
we verified the annotations of the introns related to the intron 
losses and gains using transcriptome data. The expressed se- 
quence tag (EST) data of A. thaliana, A. lyrata and the out- 
group species were retrieved from dbEST (Expressed Sequence 
Tags database, http:/AAAAAA/.ncbi. nlm.nih.gov/dbEST/, last 
accessed November 13, 2012), and the RNA-Seq reads data 
of A. thaliana (ERP001616) (Manavella et al. 2012), A. lyrata 
(SRP004429) (Hollister et al. 201 1), and B. rapa (ERR037339) 
(Harper et al. 2012) were downloaded from the Sequence 
Read Archive (http://www.ncbi.nlm.nih.gov/sra, last accessed 
November 19, 2012). These ESTs and reads were mapped to 



Genome BioL EvoL 5(4):723-733. doi:10.1093/gbe/evt043 Advance Access publication March 20, 2013 



725 



Yang et al. 



GBE 



the genome using BLAT (BLAST-like Alignment Tool) (Kent 
2002) and TopHat 2.0.5 (Trapnell et al. 2009). For an intron 
loss to be confirmed, we must first confirm that the lost se- 
quence is an intron. So the extant orthologous introns must 
be actively transcribed and spliced correctly in at least one 
intron-extant species. By consulting the transcriptome data, 
14 putative intron losses in A. thaliana were discarded due 
to lack of transcriptional information, and 6 putative intron 
losses also in A. thaliana were discarded because the introns 
were not spliced out in other species. In addition, we checked 
whether the intron-lost (IL) genes are still active after losing the 
introns by examining the transcripts they produced. We could 
not find any evidence of transcription for only one IL gene in 
A. lyrata. Furthermore, transcriptome data covered 27 of the 
56 putative intron gains in A. lyrata. And among these 27 
cases, 20 introns were found to be retained in transcripts 
and thus discarded from our data set. For the remaining 
seven intron gains, the active transcription of their ortholo- 
gous genes \nA. thaliana was also confirmed by transcriptome 
data. In total, 114 intron losses from A. thaliana, 35 intron 
losses from A. lyrata, and 7 intron gains in A. lyrata were 
supported by transcriptome data (table 1). 

Calculating Nucleotide Substitution Rates 

A previous study showed that, except for the two ends, most 
intron sequences in plants are not constrained (Guo et al. 
2007). Nucleotide substitution rates of internal regions of in- 
trons could approximately reflect mutation rates. Based on 
this study, 10 nucleotides were removed from both the 
5'- and 3'-ends of each intron before the intronic substitution 
rates were calculated. 

Alignment artifacts could increase the calculated diver- 
gence, especially in alignments of noncoding sequences. To 
avoid these potential errors, Gblocks (Castresana 2000) was 
used to detect and filter unreliable alignment regions. 
However, too-strict filtration of alignments by Gblocks would 
delete regions with high frequencies of mutations, which in 
turn would make the calculated substitution rate lower than 
the actual mutational rates. When comparing different 
genes, the differences in mutation rate would become artifi- 
cially smaller, even becoming statistically insignificant in some 
cases. Thus, a compromise was made. The intron sequence 
alignments were filtered using Gblocks (version 0.91b) 
(Castresana 2000). Specifically for noncoding sequence align- 
ments, the minimum length of a block was adjusted to 5. 



Table 1 

Arabidopsis thaliana Has Undergone More Intron Losses and Fewer 
Intron Gains than A. lyrata 





A thaliana 


A lyrata 


Pearson Test 


Lost introns 


114 


35 


p=2x ^o-^ 


Gained introns 


0 


7 




Conserved introns 


80,262 


80,262 





The maximum number of contiguous nonconserved positions 
(fa) was tested using 1, 2, 3, 4, and 8. The detected unreliable 
alignment regions were discarded. To seek an appropriate fa 
value for the program Gblocks, we compared the nucleotide 
substitution rate of introns with well-aligned control se- 
quences. Coding sequences are much easier to align because 
of the conservation of amino acid sequences. Initially, the 
orthologous coding sequences were aligned and filtered by 
Gblocks with its default parameters. Then, all the third sites 
of 4-fold degenerate codons were extracted with their relative 
positions in the alignments. These extracted synonymous bases 
were used as control sequences. To reduce the effects of 
random noise, only alignments longer than 30 bp were re- 
tained for calculating the substitution rate. The nucleotide sub- 
stitution rate of intron sequences (cfj) and the control sequences 
(dc) was estimated using the algorithmic method of Tamura 
and Nei (1993), implemented in PAUP 4.0 beta. In rice, the 
synonymous base substitution rate is half that of transposable 
elements (Gaut et al. 1996; Ma and Bennetzen 2004). The 
nucleotide substitution rates in transposable elements are 
often used as a neutral standard to represent the mutation 
rates (Gaffney and Keightley 2006). Therefore, if the d, calcu- 
lated in this study is more than twice the d^, it is likely to have 
been overestimated because of unreliable alignments. By scru- 
tinizing the ratios of d/d^ under different values of fa (supple- 
mentary table SI, Supplementary Material online), fa = 2 was 
chosen for Gblocks. When fa = 2, d/d^ is < 0.8 when compar- 
ing A. thaliana, A. lyrata, B. rapa, and T. parvula. All the analyses 
of dj were repeated using fa = 4 for Gblocks and by PAML4.2b 
(Yang 2007). Similar results were obtained (data not shown). 

The synonymous substitution rates (ds) of coding se- 
quences were calculated by PAML 4.2b (Yang 2007). 
Gblocks with its default parameters was also used to filter 
the alignments of orthologous coding sequences. 

Detection of Regulatory Elements in Introns 

CpG islands were detected by the NEWCPGREPORT program 
with its recommended settings (http://emboss.open-bio.org/ 
wiki/Appdoc:Newcpgreport, last accessed August 17, 2012). 
The total number and total length of CpG islands present in 
each intron were counted. The density of CpG islands within 
an intron was defined as the number of CpG islands and the 
length of CpG islands divided by intron length. 

The IMEter algorithm, with its default parameters, was 
used to predict the ability of an intron to enhance gene ex- 
pression (Rose et al. 2008). 

Results 

Arabidopsis thaliana Lost More Introns and Gained Fewer 
Introns Than A. lyrata 

Similar to Fawcett et al. (201 2), the sites that differ in terms of 
the presence and absence of introns between A. thaliana and 
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A. lyrata were detected by comparing their orthologous 
genes. In most cases, intron loss and gain could not be distin- 
guished because of the absence of reference genes in the 
outgroup species. Therefore, more outgroup species were 
added to form a larger data set of intron loss and gain that 
helped the statistical analysis. Using the orthologous genes 
from seven outgroup species (fig. 1), 132 losses and no 
gains in A. thaliana and 35 losses and 55 gains in A. lyrata 
were obtained from 2,534 loss/gain events. After integrating 
some cases from the data set of Fawcett et al. (2012) and 
filtering out uncertain cases, a larger data set, comprising 1 14 
intron losses in A. thaliana and 35 intron losses and seven 
intron gains in A. lyrata (table 1), was obtained. With more 
and more closely related genomes being sequenced, which 
could be used as outgroup species, we expect that more 
intron losses and gains will be identified in A. thaliana and 
A. lyrata, which might make the following results more 
significant. 

Arabidopsis thaliana lost more introns but gained fewer 
introns than A. lyrata. Fawcett et al. (2012) used the selective 
force for genome reduction of A. thaliana to explain these 
differences. However, the evolutionary mechanisms underly- 
ing the genome reduction are in dispute (Knight et al. 2005; 
Lynch etal. 201 1). Both selection for metabolic, temporal, and 
spatial economy and selection to minimize mutational hazard 
might have promoted genome reduction (Cavalier-Smith 
2005; Knight et al. 2005; Lynch et al. 201 1). Using the data 
sets in Arabidopsis, we tested the mutational-hazard hypoth- 
esis of intron loss and deduced its influence on the mechanism 
of genome reduction. It should be noted that the seven gains 
in A. lyrata is too small a sample for statistical analysis. 

Synonymous Sites of IL Genes Have Higher Mutation 
Rates 

Within a genome, the mutation rate varies greatly across 
genes (Gaut et al. 2011). According to the mutational- 
hazard hypothesis, genes in mutational hot spots have 
higher hazards and thus experience stronger selective forces 
to purge surplus sequences. If mutational hazard was the se- 
lective force for intron loss, genes with higher mutation rates 
would be more lil<ely to lose their introns. Using the synony- 
mous substitution rate between A. thaliana and A. lyrata (dsn) 
as a proxy for mutation rate, we found that IL genes have 
significantly higher mutation rates than other genes 
(P= 0.002, fig. 2A). In this article, only genes that have defi- 
nitely not lost or gained any introns in Arabidopsis were used 
as control genes and were conveniently described as "other 
genes." 

Exon sequences flanking introns often contain splicing sig- 
nals, so their synonymous sites are under selective constraints 
(Parmley and Hurst 2007; Parmley et al. 2007; Warnecke et al. 
2008). Loss of introns would eliminate such constraints and in 
consequence increase the synonymous substitution rate (ds). 
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Fig. 2. — Arabidopsis IL genes liave higlier synonymous substitution 
rates. The 10th to 90th percentiles of the data are presented. (A) The IL 
genes have significantly higher dsti compared with other genes (n= 143 
and 4,706, respectively; Mann-Whitney Utest, P= 0.002). (6) The coding 
sequences flanking lost introns have significantly higher dsti than those 
flanking conserved introns (n=149 and 14,126, respectively; IVlann- 
Whitney U test, P=3 x 10"''). (C) The coding sequences flanking the IL 
position also have higher dsrp than those flanking conserved introns 
(n=116 and 12,589, respectively; Mann-Whitney U test, P=0.002). 
Coding sequences within 100 bp of both the 5' and 3' sides of an 
intron were defined as sequences flanking the intron. Using 200 bp and 
400 bp to define flanking sequences gave similar results (data not shown), 
dsti, the synonymous substitution rate between Arabidopsis thaliana and 
A. lyrata. dsrp, the synonymous substitution rate between Brassica rapa and 
Thellungiella pan/ula. 
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The observed higher dsti of IL genes may either reflect intrinsic 
higher mutation rates or result from abandoned splicing sig- 
nals. If abandon of splicing signals is the main cause, we could 
expect that coding sequences flanking lost introns in IL species 
would have elevated ds, but their orthologous sequences of 
intron-retained species would not be affected. In contrast, if a 
higher mutation rate is an intrinsic feature of IL genes, both 
coding sequences flanking lost introns in IL species and their 
orthologous sequences of intron-retained species should have 
a higher ds. As shown in figure 26, we found that the coding 
sequences flanking lost introns have significantly higher dsti 
than those flanking conserved introns (P= 3 x 1 0^^). Similarly 
in B. rapa and T. parvula, the coding sequences flanking the 
Arabidopsis IL position also have higher ds (P= 0.002, fig. 20- 
Therefore, the higher ds of Arabidopsis IL genes reflect higher 
intrinsic mutation rates. 

Lost Introns Have Higher Mutation Rates 

According to the mutational-hazard hypothesis, introns with 
higher mutation rates are more likely to be eliminated. 
Without the exact sequences of the lost introns, it is impossible 
to directly compare the substitution rate between lost introns 
and consen/ed introns. With the assumption that the differ- 
ences in mutability among different introns are conserved in 
closely related species, we could examine whether the ortho- 
logous introns of the lost introns have higher d, values than 
those of conserved introns. 

We first tested the assumption of mutability conservation 
by analyzing conserved introns. In 43,894 groups of ortho- 
logous introns, the substitution rates between A. thaliana 
and A. lyrata (dn) and between B. rapa and T. parvuia (drp) 
were calculated. Spearman's correlation analysis showed that 
djti and d„p are significantly correlated (rho = 0.110, 
p=4x 10"^"). 

Then, we used the d.rp to represent the intron mutation 
rates of Arabidopsis. As shown in figure 3, the lost introns 
have significantly higher mutation rates than the conserved 
introns of either the same genes or other genes. 

Many introns contain regulatory elements (Rose et al. 
2008; Rearick et al. 2011), their nucleotide substitute rates 
are naturally expected to be lower than their mutation rates. 
Losses of these introns are selected against. The higher dj^p of 
lost introns may result from their paucity of regulatory ele- 
ments. It is impossible to recognize all the possible regulatory 
elements, and absolutely confident results on mutation rates 
are hard to obtain. After surveying two common kinds of 
regulatory elements, we found it to be unlikely that they 
affect our conclusion that lost introns have higher mutation 
rates. CpG islands were found to be enriched in the first 
introns of rodents and likely to regulate gene expression 
(Chamary and Hurst 2004). However, we found that the 
densities of CpG islands are not correlated with d-„p in either 
B. rapa or T. parvula (supplementary table S2, Supplementary 
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Fig. 3. — Lost introns oi Arabidopsis liave liiglier substitution rates. The 
1 0th to 90th percentiles of the data are presented. The lost introns (n = 80) 
have significantly higher djrp compared with conserved introns of the same 
genes (n = 633; Mann-Whitney U test, P= 0.004) and those of other 
genes (n= 12,182; Mann-Whitney U test, P= 0.003). d,,^ is the nucleo- 
tide substitution rate between Brassica rapa introns and Thellungiella par- 
vula introns. 

Material online). In plants, promoter-proximal introns contain 
dispersed signals to enhance gene expression, termed intron- 
mediated enhancement (IME) signals (Rose et al. 2008). If in- 
trons containing these signals are conserved, selection against 
their loss would result in the pattern that conserved introns 
have lower nucleotide substitution rates than lost introns. An 
IMEter score was designed to predict the ability of introns to 
enhance plant gene expression (Rose et al. 2008). Unexpect- 
edly, we found that the IMEter score is positively correlated 
with djrp (supplementary table S3, Supplementary Material 
online). It seems that consen/ed regulating motifs do not 
occupy a large percentage of the nucleotides in introns, and 
thus their existence does not reduce the overall substitution 
rate of introns. We accept that losses of introns containing 
regulatory elements are selected against, and only introns with 
fewer or no regulatory elements are free to be lost in evolu- 
tion. However, the higher d|rp of lost introns we observed 
could not be explained by paucity of regulatory elements 
but could be explained by higher mutation rates. 

Higher Mutation Rate Coincides with More Intron Loss 
Globally 

Compared with A. lyrata, A. thaliana not only compacted its 
genome globally but also lost more introns and gained fewer 
introns (table 1). If the mutational-hazard hypothesis accounts 
for the selective force of intron and genome size evolution, 
A. thaliana should have a globally higher mutation rate than 
A. lyrata. A previous analysis of the internal transcribed spacer 
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sequences of nuclear ribosomal DNA suggested a higher mu- 
tation rate in A. thaliana than in A. lyrata (Soria-Hernanz et al. 
2008). In this study, we calculated the A. thaliana - B. rapa 
substitution rates (d|tr for intron sequences and dstr for synon- 
ymous sites) and the A. lyrata - B. rapa substitution rates 
(c/|ir for intron sequences and dsir for synonymous sites). The 
differences between d|tr and dyr and between ds^ and dsir 
could reflect the differences of mutation rates between 
A. thaliana and A. lyrata because the same reference se- 
quences were used. Wilcoxon signed ranks testing showed 
that A. thaliana has a significantly higher substitution rate 
than A. lyrata (table 2). Furthermore, we performed a relative 
rate test to confirm this difference using RRTree (Robinson- 
Rechavi and Huchon 2000). For ds, the evidence for higher 
mutation rates in A. thaliana than in A. lyrata is obvious. In 
6,91 7 of the 1 1 ,257 comparisons, A. thaliana has a higher ds 
than A. lyrata. Among these 6,91 7 comparisons, 627 showed 
significant differences (P<0.05). In contrast, A. lyrata has a 
higher ds in 4,338 comparisons; among them, 197 are signif- 
icant (P<0.05). For di, A. thaliana also appears to have a 
higher mutation rate than /A. lyrata. Consistent results were ob- 
tained in 33,083 of the 64, 1 33 comparisons, with 1 ,974 com- 
parisons being significant (P<0.05). In contrast, inconsistent 
results were obtained less frequently, 27,854 of 64,133 com- 
parisons with 1,269 comparisons being significant (P< 0.05). 

Negative Correlation between Intron Number and 
Mutation Rate 

If genes with higher mutation rates are more likely to lose 
introns, and intron loss dominates intron gain, one could 
expect that genes with higher mutation rates tend to have 
fewer introns. Using the nucleotide substitution rate between 
conserved introns as a proxy for mutation rate, we found that 
the mutation rate was negatively correlated with intron 



Table 2 

Arabidopsis tlialiana Has a Higher Global Mutation Rate than A. lyrata^ 



Percentile 




rfilr 


dsn 




10 


0.174 


0.170 


0.317 


0.305 


20 


0.211 


0.205 


0.353 


0.341 


30 


0.239 


0.233 


0.381 


0.371 


40 


0.266 


0.260 


0.410 


0.397 


50 


0.293 


0.288 


0.439 


0.426 


60 


0.325 


0.319 


0.471 


0.456 


70 


0.360 


0.356 


0.511 


0.495 


80 


0.408 


0.404 


0.571 


0.551 


90 


0.481 


0.478 


0.672 


0.654 



Note. — c^itr, nucleotide substitution rate between A thaliana introns and B. rapa 
introns; dur, nucleotide substitution rate between A. lyrata introns and Brassica rapa 
introns; c/str, synonymous substitution rate between A. thaliana genes and B. rapa 
genes; dsi,, synonymous substitution rate between A lyrata genes and 6. rapa genes. 
The median values (i.e., the 50th percentiles) are highlighted in bold. 

Wilcoxon signed ranks test showed that d^r is significantly higher than d]\t 
{57,008 pairs of samples were compared, P— 7x 10"^^) and dst, is significantly 
higher than ds\r (13,208 pairs of samples were compared, P— 5 x 10""^). 



number per gene in both A. thaliana and A. lyrata 
(Spearman's rho = -0.170, P=2x10-^^ n= 14,139 in 
A. thaliana and Spearman's rho = -0.171, P=3x10"^^, 
n= 14,139 in A. lyrata). In addition, mutation rate and 
intron density (i.e., the number of introns per kilobase of 
mRNA) are also negatively correlated in these two species 
(Spearman's rho = -0.055, P=8x10"", n=14,105 in 
A. thaliana and Spearman's rho = -0.027, P= 0.001, 
n= 14,139 in A lyrata). These negative correlations are still 
significant when expression level and GC content were con- 
trolled (supplementary table S4, Supplementary Material 
online). Similarly, Yang and Gaut (201 1) found that ds is neg- 
atively correlated with intron number per gene in Arabidopsis. 

Although the P values of these correlations are very small, 
we do not think that this is strong evidence for the mutational- 
hazard hypothesis. The total extant introns are the results of 
long-term evolution of intron gain and loss, whereas the mea- 
sured mutation rates are recent evolutionary features. It is 
unclear whether the differences in mutation rates among dif- 
ferent genes are conserved in long-term evolution. In contrast, 
the association of recent intron loss with recently high muta- 
tion rate is more likely to reflect intrinsic causal effects. 

Discussion 

Spliceosomal introns are a common feature of eukaryotic 
genes. Their density varies greatly across genomes, as well 
as among genes of the same genome. However, the evolu- 
tionary mechanisms that control the gain and loss of introns 
are not clear. The model plant /\. thaliana provides an oppor- 
tunity to explore the problem. Within 10 Myr, A. thaliana has 
lost about half of its genome (Proost et al. 201 1). It has also 
lost more introns and gained fewer introns than its close rel- 
ative A. lyrata (Fawcett et al. 2012). The selective force that 
has driven this genome reduction has been proposed as a 
force favoring intron losses. Certain ideas can be borrowed 
directly from another well studied but still highly debated sub- 
ject, the evolution of genome size. In this study, we mainly 
tested whether the mutational hazard hypothesis could be 
used to explain the pattern of intron loss. This hypothesis pro- 
poses that noncoding sequences have slightly deleterious ef- 
fects on fitness because of the hazard of accumulating 
deleterious mutations (Lynch 2006, 2007b; Lynch et al. 
2006). According to this hypothesis, selection to minimize 
the mutational hazard would preferentially remove surplus 
DNA from genomes and genes with high mutation rates. 
Consistently, we found that IL genes have higher mutation 
rates than other genes, and lost introns have higher mutation 
rates than conserved introns. Furthermore, we found that 
A. thaliana has a higher genome-wide mutation rate than 
A. lyrata. 

According to the principle of population genetics, the effi- 
ciency of selection in removing slightly deleterious mutations 
depends heavily on the effective population size (A/g). 
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Organisms with a larger A/e are expected to purge noncoding 
sequences more effectively than those with a smaller A/e 
(Lynch and Conery 2003). Unfortunately, the difference in 
A/e between A. thaliana and A. lyrata is not monotonic. 
Historically, A. thaliana had an Ng about four times of 
A. lyrata; however, its A/e is now much smaller than that of 
A. lyrata (Lundemo et al. 2009; Falahati-Anbaran et al. 201 1; 
Gomaa et al. 201 1). The differences in the rates of intron loss 
and gain between A. thaliana and A. lyrata are consistent with 
the difference in historical A/g. If most introns are slightly del- 
eterious, as assumed in the nonadaptive view, there should be 
a significant decline in the rate of intron loss and an expansion 
of genome size during the most recent evolution of A. thali- 
ana. This prediction may be tested with the numerous natural 
accessions of A. thaliana currently being sequenced. 

In addition to the deleterious effects of mutational hazards, 
metabolic, spatial, and temporal economy might also act as 
selective forces to remove surplus DNA (Cavalier-Smith 2005; 
Knight et al. 2005). According to selection for economy of 
gene expression, highly and quickly expressed genes can be 
expected to experience stronger selective forces for shorter 
introns. Rapidly expressed genes were found to be intron 
poor in organisms from yeasts to humans (Chen et al. 2005; 
Jeffares et al. 2008). However, the fact that highly expressed 
genes have short introns was only observed in animals 
(Castillo-Davis et al. 2002; Li et al. 2007; Carmel and Koonin 
2009). In plants and yeasts, most studies revealed a positive 
correlation between intron size and expression level (Juneau 
et al. 2006; Ren et al. 2006; Li et al. 2007; Jeffares et al. 2008). 
Even in animals, the energetic cost of long introns seems to be 
too small to be efficiently selected against (Huang and Niu 
2008). In spite of this, we tested whether selection for tem- 
poral and energetic economy of gene expression have driven 
intron losses in the evolution of Arabidopsis. If the time cost of 
transcription and splicing of introns had driven the intron 
losses, introns in genes that require rapid changes in their 
rate of expression could be expected to be preferentially 
lost. However, we did not find that IL genes were significantly 
more rapidly expressed than other genes in A. thaliana 
(Mann-Whitney U test, P=0.82). The IL genes have signifi- 
cantly higher expression levels than other genes (Mann- 
Whitney U test, P=0.016). However, highly expressed 
genes tend to have more introns in A. thaliana (Jeffares 
et al. 2008). The highly expressed genes may be more likely 
to lose introns just by chance. For each IL gene, we randomly 
selected a gene with the same number of introns from those 
that have not lost or gained any introns. Then we performed 
Wilcoxon signed-rank test to examine whether IL genes have 
higher level of gene expression. A total of 10,000 rounds of 
random samplings and pairwise tests were carried out. In most 
cases (8,798 rounds), IL genes did not have a significantly 
higher level of expression. Therefore, selection for economy 
of gene expression could not explain the rapid intron losses in 
A. thaliana. There is still no convenient way to explore the 



nuclear space constraint of introns and the selection for econ- 
omy in DNA replication. Further studies are required to deter- 
mine whether nuclear space constraint and selection for 
economy in DNA replication coexist with mutational hazards 
in driving intron loss and genome reduction. 

Genome-wide intron density is positively correlated with 
the generation time of eukaryotes (Jeffares et al. 2006). 
That is, organisms with shorter life cycles tend to have fewer 
introns than slowly growing organisms. More notably, small 
genomes are correlated with many phenotypic features, such 
as small nuclei, small cells, short cell cycles, high metabolic/ 
photosynthetic rates, small seed sizes, rapid growth, and short 
generation time (Cavalier-Smith 2005; Gregory 2005; Knight 
et al. 2005; Dufresne and Jeffery 201 1). We suggest that the 
mutational-hazard hypothesis does not necessarily conflict 
with such correlations. Instead, the selection to minimize mu- 
tational hazards could be an alternative explanation for such 
correlations. Small organisms with rapid growth and repro- 
duction generally have high metabolic rates (Glazier 2010; 
Price et al. 2010). A high rate of metabolism is associated 
with high rates of oxygen consumption and free radical gen- 
eration, which in turn causes more DNA damage. In addition, 
rapid cell division results in higher accumulation of replication 
errors in genomes. Although A. thaliana and A. lyrata are 
closely related, A. thaliana is annual, whereas A. lyrata is pe- 
rennial. Arabidopsis thaliana has a significantly higher muta- 
tion rate than A. lyrata (table 2). In both plants and animals, 
many previous studies have shown that organisms with short 
generations have higher substitution rates than organisms 
with longer generations (Gaut et al. 1996, 2011; Li et al. 
1996; Nikolaev et al. 2007; Soria-Hernanz et al. 2008; 
Muller and Albach 2010). Thus, selection for rapid growth 
indirectly increased mutational hazards, which in turn might 
act as a selective force to remove surplus DNA. 

In addition to selection to minimize mutational hazards, 
there is another possible, but less likely, explanation for the 
association between mutation rate and intron loss: mutation 
bias. Similar to the pattern of intron loss we observed in 
Arabidopsis, Magee et al. (2010) found that genes in the 
hypermutable regions of legume chloroplast genomes are 
preferentially lost and relocated to the nucleus. The authors 
did not invoke selective pressure but mutation bias. Could our 
observation of intron loss be explained simply by mutation 
bias? Under certain conditions, genes experience higher fre- 
quencies of both double-strand breaks (DSBs) and point mu- 
tations (Shee et al. 2011; Kim and Jinks-Robertson 2012). If 
DSB repair was the predominant mechanism of both intron 
loss and intron gain, as recently suggested (Li et al. 2009; 
Farlow et al. 201 1; Ragg 201 1; Fawcett et al. 2012), intron 
loss and gain would tend to occur simultaneously in hypermu- 
tated genes. This hypothesis and the mutational-hazard hy- 
pothesis both predict that introns with higher mutation rates 
would be preferentially lost. However, these two hypotheses 
differ entirely in their predictions of the rate of intron gain. The 



730 Genome Biol. EvoL S{4y.723-733. doi:10.1093/gbe/evt043 Advance Access publication March 20, 2013 



Loss of Introns from High Mutation Rate Genes 



GBE 



mutation bias hypothesis predicts that genes and genomes 
with higher mutation rates are more likely to gain introns, 
whereas the mutational-hazard hypothesis predicts that they 
are less likely to gain introns. Limited by the small number of 
intron gains observed, we were unable to assess the mutation 
rates of intron-gained genes with any statistical confidence. 
However, these two hypotheses can be distinguished by com- 
parisons at the genome-wide level. Arabidopsis thaliana has a 
higher genome-wide mutation rate than A. lyrata; therefore, 
the mutation bias hypothesis predicts that/4, thaliana would 
have gained more introns, whereas the mutational-hazard 
hypothesis predicts that A. thaliana is less likely to have 
gained introns. Fawcett et al. (2012) reported two intron 
gains in A. thaliana and six intron gains in A lyrata. Using 
more outgroup genomes to distinguish intron loss and gain, 
we found seven cases of intron gain in A. lyrata but no cases of 
intron gain in A. thaliana. If putative gained introns that lack 
support from transcriptome data are also considered, the dif- 
ference in the number of intron gains between A. thaliana and 
A. lyrata becomes much larger: none versus 56. In conclusion, 
the pattern of intron loss and gain observed in A. thaliana and 
A. lyrata is more consistent with the mutational-hazard 
hypothesis. 

Supplementary Material 

Supplementary tables S1-S6 are available at Genome Biology 
and Evolution online (http://vvww.gbe.oxfordjournals.org/). 
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